home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Collection of Tools & Utilities
/
Collection of Tools and Utilities.iso
/
tex
/
kjdc9308.zip
/
kanjidic.doc
< prev
next >
Wrap
Text File
|
1993-08-26
|
13KB
|
310 lines
K A N J I D I C
===============
Introduction
------------
Kanjidic contains comprehensive information about the Japanese kanji
characters. It is a text file currently 6,353 lines long,
with one line for each kanji in the two levels of the JIS X 0208-1983
set. (For information about this set, see the Appendix 1.)
Eventually it will be upgraded to the JIS X 0208-1990 version.
The file contains a mixture of ASCII characters and kana/kanji encoded using
the EUC (Extended Unix Code) coding.
Contents & Format
-----------------
The first part of each line is of a fixed format, indicating which
character the line is for, while the rest is more free-format.
The first two bytes are the kanji itself. There is then a space, the 4-byte
ASCII representation of the hexadecimal coding of the two-byte JIS encoding,
and another space.
The rest of the line is composed of a combination of three kinds of fields
(which may be in any order and interspersed):
1) Readings (with '-' to indicate prefixes/suffixes, and '.' to separate
a reading from its okurigana). ON-yomi are in katakana, while KUN-yomi
are in hiragana.
2) English translations and/or notes. Each such field begins with an
open brace '{' and ends at the next close brace '}'.
3) Information fields, beginning with an identifying letter and ending
with a space. There are currently a variety of predefined fields
(program using kanjidic should not make any assumptions about the
presence or absence of any of these fields, as kanjidic
is certain to be extended in the future):
B<num> -- The radical (Bushu) number. There is at least one per line.
As far as possible, this is the radical number used in
Nelson. Where the classical or historical radical number
differs from this, it is present as a separate C<num> entry.
There should be one Bnnn only.
C<num> -- The historical or classical radical number (where this
differs from the B<num> entry.) There may be zero,
one or several of these.
F<num> -- The frequency-of-use ranking. At most one per line.
The 2,135 most-used characters have a ranking.
Those characters that lack this field are not ranked.
G<num> -- The Jouyou grade level. At most one per line.
G1 through G6 indicate Joyo grades 1-6.
G8 indicates general-use characters.
G9 indicates Jinmeiyou ("for use in names") characters.
If not present, it is a kanji outside these categories.
H<num> -- The index number in Jack Halpern's dictionary.
At most one allowed per line. If not preset, the
character is not in Halpern.
N<num> -- The index number in the Nelson dictionary. At most
one allowed per line. If not present, the character
is not in Nelson, or is considered to be a non-standard
version, in which case there will be {see Nnnn} appended.
P<code> -- The SK*P pattern code (similar to Halpern). The <code>
is of the form "P<num>-<num>-<num>". See Halpern for
a description of his SKIP pattern code, which is
similar to this. A brief summary of the method is in
Appendix 3
[NB: the Pn-n-n codes have been removed from
kanjidic as of 4 August 1993. The removable has
taken place to avoid violation of Mr Halpern's
copyright of this list of codes.]
S<num> -- The stroke count. At least one per line. If more than
one, the first is considered the accepted count, while
subsequent ones are common miscounts.
U<hexnum> - Exactly one per line. The Unicode encoding of the kanji.
See Appendix 2 for futher information on this.
Qnnnn.n - The "Four Corner" code for that kanji. This is a rather
old code used in China and Japan. In some cases there
are two of these codes, as it is a little ambiguous.
MNnnnnnnn and MPnn.nnnn The index number and volume.page respectively
of the kanji in the 13-volume Morohashi "DaiKanWaJiten.
Ennnn - The index number used in "A Guide To Remembering
Japanese Characters" by Kenneth G. Henshall. There
are 1945 kanji with these numbers (i.e. the Jouyou
subset.)
Yxxxxx - The "PinYin" of each kanji, i.e. the (Mandarin or
Beijing) Chinese romanization. About 6,000 of the
kanji have these. Obviously the native Japanese
kokuji do not have PinYin.
(Many of the kanji also have indices for the Spahn & Hadamitsky
dictionary. At present they are encoded in the "meaning" field,
but will shortly be moved to the index region of the records.)
If the final field of a line is not an English field, there is a final space.
Each reading and info field is therefore bracketed by a space (which makes
it convenient for grep-based searches).
As far as possible all entries will have their yomikata and readings
attached, even if they are a recognized variant of another kanji. This is
to facilitate electronic searches using these fields as keys, and should
not be taken as a recommendation to use such obscure kanji.
Usage
-----
Kanjidic is used now to build the "kinfo.dat" file which is used by JDIC
and JREADER, and by Stephen Chung's JWP. "kinfo.dat" contains the identical
information, but in a compressed form and in a structure suitable for fast
indexed access.
Kanjidic is also used in the XJDIC program.
Support
-------
Kanjidic was originally compiled, and is maintained by:
Jim Breen
(jwb@capek.rdt.monash.edu.au)
Department of Robotics & Digital Technology
Monash University, Victoria, Australia
If you have changes, send diffs [not complete files] with corrections to him.
Too Much Information?
---------------------
Kanjidic is now rather large, and has information in it which is not much
use for people who are not studying and researching Japanese orthography.
It is still appropriate to maintain it as a useful compendium of such
information in the Public Domain.
For people who only wish to use a subset of the information in kanjidic,
there is a program "kdfilt.c", also available as kdfilt.exe for MS-DOS,
which will strip out unwanted fields.
History (comments by Jim Breen)
-------
Kanjidic began as two files: jis1detl.lst and jis2detl.lst.
The first file was compiled initially from the file "kinfo.dat" supplied by
Stephen Chung, who in turn compiled his file from a file prepared by Mike
Erickson. I originally added about 1900 "meanings" by James Heisig keyed in by
Kevin Moore from the book "Remembering The Kanji". I later added the ex-Nelson
meanings from Rik Smoody's files, compiled when he was working for Sony in
Japan.
The second file was compiled from a complete JIS2 list with Bushu and stroke
counts kindly supplied to me by Jon Crossley, to which I added Nelson numbers,
yomikata and meanings extracted from a dictionary file prepared by Rik Smoody
at Sony.
The file is being continually updated with extra and corrected yomikata,
Nelson nos, meanings, etc. Theresa Martin has been a great assistance with
this, particularly with tracking down and correcting many mistranscribed
yomikata (the old zu/dzu, oo/ou, ji/dji, etc. problems).
Jeffrey Friedl did a major overhaul in September-October 1992, in which he
added frequency rankings, Halpern codes, SK*P patterns, updated the grading
("G" fields) to reflect the modern Jouyou lists, corrected radical numbers,
corrected stroke counts and readings to fall in line with modern usage.
Magnus Halldorsson corrected some erroneous Halpern numbers, and provided
them for a lot of the radicals.
Lee Collins provided the Unicode mappings (see appendix 2)
Iain Sinclair has provided the yomikata, meanings and S&H indices of many of
the obscure JIS2 kanji.
Christian Wittern, a Sinologist working at Kyouto U, sent me a monster file
prepared and released by Dr Urs App from Hanazono University. From this
I have extracted the "Four Corner", Morohashi and PinYin information. I am
very grateful for this significant contribution.
Alfredo Pinochet supplied all the Henshall numbers.
In July 1993, aft